<<<<<<< HEAD wineQualityReds

Univariate plot section:

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## Warning in data(wines): data set 'wines' not found
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## NULL
##  [1]  7.4  7.8 11.2  7.9  7.3  7.5  6.7  5.6  8.9  8.5  8.1  7.6  6.9  6.3
## [15]  7.1  8.3  5.2  5.7  8.8  6.8  4.6  7.7  8.7  6.4  6.6  8.6 10.2  7.0
## [29]  7.2  9.3  8.0  9.7  6.2  5.0  4.7  8.4 10.1  9.4  9.0  8.2  6.1  5.8
## [43]  9.2 11.5  5.4  9.6 12.8 11.0 11.6 12.0 15.0 10.8 11.1 10.0 12.5 11.8
## [57] 10.9 10.3 11.4  9.9 10.4 13.3 10.6  9.8 13.4 10.7 11.9 12.4 12.2 13.8
## [71]  9.1 13.5 10.5 12.6 14.0 13.7  9.5 12.7 12.3 15.6  5.3 11.3 13.0  6.5
## [85] 12.9 14.3 15.5 11.7 13.2 15.9 12.1  5.1  4.9  5.9  6.0  5.5
## [1] 5 6 7 4 8 3
##  [1]  1.90  2.60  2.30  1.80  1.60  1.20  2.00  6.10  3.80  3.90  1.70
## [12]  4.40  2.40  1.40  2.50 10.70  5.50  2.10  1.50  5.90  2.80  2.20
## [23]  3.00  3.40  5.10  4.65  1.30  7.30  7.20  2.90  2.70  5.60  3.10
## [34]  3.20  3.30  3.60  4.00  7.00  6.40  3.50 11.00  3.65  4.50  4.80
## [45]  2.95  5.80  6.20  4.20  7.90  3.70  6.70  6.60  2.15  5.20  2.55
## [56] 15.50  4.10  8.30  6.55  4.60  4.30  5.15  6.30  6.00  8.60  7.50
## [67]  2.25  4.25  2.85  3.45  2.35  2.65  9.00  8.80  5.00  1.65  2.05
## [78]  0.90  8.90  8.10  4.70  1.75  7.80 12.90 13.40  5.40 15.40  3.75
## [89] 13.80  5.70 13.90
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

As for quality, most wines have been evaluated above average as median is bigger than mean. for most variables median is below mean, most notabaly for total.sulfur.dioxide where if above 50ppm the smell and taste becomes evident, the median is substantially bellow the mean, still 25% of wines have over 62ppm. with most attributes except density, PH, and to some extend alcohol, the varaition within the four quartiles is wide, specially between the min and the max which can be because of outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

How does the distribution of total.sulfur.dioxide differ for different qualities? according to the description of the data set there might be a relationship between the two. I wonder how other variables will affect the quality. The table shows the number of wines with different wine qualities.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Stacking not well defined when ymin != 0

Let’s see which alcohol degree is the most common.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## 
##              8.4              8.5              8.7              8.8 
##                2                1                2                2 
##                9             9.05              9.1              9.2 
##               30                1               23               72 
## 9.23333333333333             9.25              9.3              9.4 
##                1                1               59              103 
##              9.5             9.55 9.56666666666667              9.6 
##              139                2                1               59 
##              9.7              9.8              9.9             9.95 
##               54               78               49                1 
##               10 10.0333333333333             10.1             10.2 
##               67                2               47               46 
##             10.3             10.4             10.5            10.55 
##               33               41               67                2 
##             10.6             10.7            10.75             10.8 
##               28               27                1               42 
##             10.9               11 11.0666666666667             11.1 
##               49               59                1               27 
##             11.2             11.3             11.4             11.5 
##               36               32               32               30 
##             11.6             11.7             11.8             11.9 
##               15               23               29               20 
##            11.95               12             12.1             12.2 
##                1               21               13               12 
##             12.3             12.4             12.5             12.6 
##               12               13               21                6 
##             12.7             12.8             12.9               13 
##                9               17                9                6 
##             13.1             13.2             13.3             13.4 
##                2                1                3                3 
##             13.5 13.5666666666667             13.6               14 
##                1                1                4                7 
##             14.9 
##                1

A large number of wines fall between 9 and 10 degrees of alcohol. The median is 10.2. I am including the table for wine alcohol, because alcohol has the strongest corrolation with quality and we can see the number of wines with a given amount of alcohol. The largest number is the wines with 10.5.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

For fixed acidity, median is 7.90 and mean is lower because of outliers

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

I will create a new variable called total acidity and I wonder if it has a direct corolation with quality

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I wonder if there is any connection between percentage of alcohol and the quality

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

most wines have 9.50% - 11.10% alcohol. Median is 10.2%

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

132 of wines in the data set have 0 citric acid. as per description of the data set, citric acid can add freshness and flavour to wines. I wonder if it has any affect on the variable “quality”" in this data set and how the two might be connected.The difference between the first quartile and the median is roughly 30 fold. that shows that a large number of wines have a very low amount of citric acid

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Stacking not well defined when ymin != 0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

There is a huge difference between the max (15.5) and the the 3rd qu. for sugar. That shows that there are outliers towards the end spectrum. using scale-y-log10 will shed a light on outliers and scale-x-log10 will show the normal distribution (bell shaped).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

again with chloride we see outliers to the right.transformed the long-tailed data to understand it better.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

another transformation accross y access

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Stacking not well defined when ymin != 0

total.sulfur.dioxide seem to be another factor that might have negative affect on the smell and taste specially if it is over 50.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

differce between the mean and median is larger than many other variables. median is 38 and mean is 46.47. there are only 9 samples between 150 and 289.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

again the data is skewed in case of free.sulfur.dioxide and I have to do log transformation in order to see the distribution. mean is 15.87 and median is 14 for free.sulfur.oxide.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

distribution for sulphates amounts in also right-skewed. there are outliers, but the difference between different quartiles is not as stark.

## [1] 1599
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

the distribution is normal for different densities. with first quartile, median, mean and third quartile very close to each other.

Univariate analysis

What is the structure of your dataset?

there are 1599 observations (red wine samples) in the dataset and 11 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chloride, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). most of features except for density, pH and quality are right-skewed. and have some extreme outliers to the right.

other observations:

As for quality, most wines have been evaluated above average as median is bigger than mean. for most variables median is below mean, most notabaly for total.sulfur.dioxide where if above 50ppm the smell and taste becomes evident, the median is substantially bellow the mean, still 25% of wines have over 62ppm. with most attributes except density, PH, and to some extend alcohol, the varaition within the four quartiles is wide, specially between the min and the max which can be because of outliers.

What is/are the main feature(s) of interest in your dataset?

the main feature of interest in my dataset is quality. I would like to know what features affected the determination of the quality by experts. I suspect total.sulfur.dioxide, residual.sugar, volatile.acidity and citric.acid would have the most effect.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

total.sulfur.dioxide, residual.sugar, volatile.acidity and citric.acid are features that I am most interested in, but a looking into other features or a combination of some of them might be of help in effective investigation of the dataset and building a model.

Did you create any new variables from existing variables in the dataset?

I created a new feature called total.acidity which is the sum of fixed.acidity and volatile.acidity. I will have to examine if it has any connection to the quality and if it improves building a model.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

most of the features where right-skewed and I log-transformed them to get a better sense of the data. in case of total.sulfur.dioxide, it was done on the y axis and in case of residual.sugar it was done on both axes separately, as it is both right skewed and it has a wide range of outliers.

Bivariate Plots Section

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000

For quality, the strongest positive corrolation is seen with alcohol and a weak corrolation with sulphates and citric acid. There is a negative corrolation between quality and volatile acidity and a weak negative corrolation with total sulfur dioxide and chloride.There is a strong corrolation between density and fixed acidity and within pH and fixed acidity.

Using scatterplot to see relation-ship between fixed.acidity, pH, density and citric acid.

## Warning: Removed 49 rows containing missing values (geom_point).

as citric acid increases, the variation in fixed acidity increases. The relation between the two seem to be linear.

Above we can see the linear relation between the two variables more clearly and also the increase of variation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## Warning: position_stack requires non-overlapping x intervals

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

it seems like most of the wines with higher quality have a higher level of citric acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## Warning: position_stack requires non-overlapping x intervals

there is a corrolation between amount of alcohol and quality and there is no low alcohol wine with high quality let’s see them in numbers.

## Warning: Removed 1 rows containing non-finite values (stat_summary).

## 
##              9.2              9.5              9.7              9.8 
##                2                2                2                2 
##              9.9               10             10.1             10.2 
##                4                9                2                4 
##             10.3             10.4             10.5            10.55 
##                1                1               10                1 
##             10.6             10.7             10.8             10.9 
##                6                1               11                5 
##               11             11.1             11.2             11.3 
##               13                4               10                8 
##             11.4             11.5             11.6             11.7 
##                3                6                6               13 
##             11.8             11.9               12             12.1 
##               11                5                9                8 
##             12.2             12.3             12.4             12.5 
##                4                7                6               10 
##             12.6             12.7             12.8             12.9 
##                3                3                8                4 
##               13             13.1             13.3             13.4 
##                2                1                1                2 
## 13.5666666666667             13.6               14 
##                1                3                3

wines with higher alcohol have usually higher quality

most wines have quality which is 5 and 6.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

highest quality wines (8) have the highest median and the lowest quality wines which are labeled at 3 have the lowest amount of alcohol, except for the ones that are scored at 5.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.162   3.230   3.267   3.350   3.720

I see a weak trend towards more basic wines having higher quality score. although the corrolation is very weak we can see that the median for wines with quality 8 is highest.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

I see a relatively stronger corrolation between volatile acidity and quality(a negative one)

there seem to be positive corrolation between the two vairables citric acid and density. But they corrolations with quality seem to be opposite to one an other.

## Warning: Removed 162 rows containing non-finite values (stat_smooth).
## Warning: Removed 162 rows containing missing values (geom_point).

the relationship between citric.acid and density seem to be linear but it’s week and datapoints are very dispersed (there is a big variation)

## 
## Calls:
## m1: lm(formula = density ~ citric.acid, data = subset(wines, citric.acid > 
##     0 & citric.acid <= quantile(wines$citric.acid, 0.999)))
## 
## =============================
##   (Intercept)      0.996***  
##                   (0.000)    
##   citric.acid      0.004***  
##                   (0.000)    
## -----------------------------
##   R-squared            0.1   
##   adj. R-squared       0.1   
##   sigma                0.0   
##   F                  210.8   
##   p                    0.0   
##   Log-likelihood    7229.2   
##   Deviance             0.0   
##   AIC             -14452.3   
##   BIC             -14436.5   
##   N                 1465     
## =============================

the model trained based on citric.acid to explain density, explains only 10% of variance which is negligble.

there is a corrolation between density and fixed.acidity. the higher the fixed.acidity, the higher the density.

## 
## Calls:
## m2: lm(formula = quality ~ alcohol, data = subset(wines, alcohol > 
##     0 & alcohol <= quantile(wines$alcohol, 0.999)))
## 
## =============================
##   (Intercept)      1.818***  
##                   (0.175)    
##   alcohol          0.366***  
##                   (0.017)    
## -----------------------------
##   R-squared            0.2   
##   adj. R-squared       0.2   
##   sigma                0.7   
##   F                  480.4   
##   p                    0.0   
##   Log-likelihood   -1715.4   
##   Deviance           800.7   
##   AIC               3436.8   
##   BIC               3452.9   
##   N                 1598     
## =============================

despite corrolation of 0.47 between alcohol and quality, the model only explains 20% of variance of quality

## 
## Calls:
## m3: lm(formula = quality ~ volatile.acidity, data = wines)
## 
## ===============================
##   (Intercept)        6.566***  
##                     (0.058)    
##   volatile.acidity  -1.761***  
##                     (0.104)    
## -------------------------------
##   R-squared              0.2   
##   adj. R-squared         0.2   
##   sigma                  0.7   
##   F                    287.4   
##   p                      0.0   
##   Log-likelihood     -1794.3   
##   Deviance             883.2   
##   AIC                 3594.6   
##   BIC                 3610.8   
##   N                   1599     
## ===============================

only 20% of variance explained here. Perhaps I should add more features to the model in the next part.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is a moderate corrolation between quality and volatile.acidity. There is a stronger corrolation between quality and alcohol and a weaker one with citric.acid and sulphates and density.

There are, as one would expect, stronger corrolations between features that are related such as fixed.acidity and pH (pH is a measurement of acidity).

wines with higher amount of citric acid, alcohol and sulphates are likelier to have a higher quality. and the corrolation with volatile.acidity seem to be negative.

most wines have quality of 5 or 6 (80-90%).

the variation of all features is large and corrolations except for features which are basically related by nature such as acidity and pH, are week. The scatter plots also seem to be really scatterd.

wines with higher acid citric seem to have a higher density.

using R2 to explain variance in quality based on one feature does not seem to give a good result. In next section I will use more than one feature and see if there is any improvements.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

there is a corrolation between free.sulfur.dioxide and total.sulfur.dioxide and it is understandable because one is subset of the other. also between acid.citric and density. there is even a stronger one between density and fixed acidity.

What was the strongest relationship you found?

the strongest relationship is between fixed.acidity and pH. the higher the fixed.acidity, the lower the pH. There is also a strong corrolation between density and fixed.acidity. there are not very strong relation between any of them and quality.

Multivariate Plots Section:

I did the second plot with only the top and lowest quality to make the distinction more clearly. The first plot is for all different qualities. It seems that comparing between the lowest quality and the higest, for the same amount of sulfate the wines have lower pH.

As expected with alcohol, the higher the alcohol for the same amount of sulfate the quality seems to be higher.

The general trend seem to be for wines with higher volatile.acidity seem to have lower quality. this corresponds with the corrolation results. we can see for higer qualities higher alcohol seem to be compensating for higher volatile.acidity.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.04975 0.35250 1.56300 3.21400 5.94000 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.882   1.757   2.700   9.400 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.950   2.185   2.412   3.572  10.270 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.972   2.764   2.923   4.654   9.112 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.310   4.720   4.288   5.685   9.880 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.420   3.375   5.116   4.624   6.160   8.978

the product of the two positively corrolated features seem to demonstrate their affect in quality more clearly.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.360   5.171   5.320   5.637   5.681   8.514 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.003   4.753   5.871   6.092   6.612  18.800 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.705   5.130   5.723   6.145   6.615  19.400 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.171   5.985   6.763   7.171   8.033  19.300 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.563   7.326   8.378   8.493   9.564  13.560 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.150   8.045   9.322   9.257  10.520  11.480

Corrolation appears here as well.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00245 0.01930 0.11220 0.23840 0.37620 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0144  0.0456  0.1290  0.1408  2.0000 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0510  0.1292  0.1602  0.2310  0.9576 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0567  0.1680  0.1925  0.2965  0.9044 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2008  0.2976  0.2837  0.3893  0.7344 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0246  0.2164  0.3424  0.2966  0.3675  0.5904

the difference between the median of product of sulphates and citric.acid at quality 3 and 8 seem to be manyfold.

as I expected, higher quality wines tend to have higher alcohol, which overshadows the affect of higher sulphates which is very weak.

general trend seem to be for higher sulphates, lower volatile.acidity and higher alcohol to have higher quality.

## 
## Calls:
## m4: lm(formula = quality ~ alcohol, data = wines)
## m5: lm(formula = quality ~ alcohol + sulphates, data = wines)
## m6: lm(formula = quality ~ alcohol + sulphates + citric.acid, data = wines)
## m7: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity, 
##     data = wines)
## m8: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density, data = wines)
## m9: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity, data = wines)
## m10: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + chlorides, data = wines)
## m11: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + chlorides + residual.sugar, 
##     data = wines)
## m12: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + chlorides + residual.sugar + 
##     total.sulfur.dioxide, data = wines)
## m13: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + chlorides + residual.sugar + 
##     total.sulfur.dioxide + pH, data = wines)
## 
## ============================================================================================================================================
##                            m4         m5         m6         m7          m8          m9         m10         m11         m12         m13      
## --------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            1.875***   1.375***   1.434***   1.138***   62.356***   30.401*     31.514*     42.871*     47.404**    25.493     
##                         (0.175)    (0.177)    (0.176)    (0.214)    (15.472)    (15.163)    (15.111)    (17.775)    (17.729)    (21.142)    
##   alcohol                0.361***   0.346***   0.338***   0.346***    0.296***    0.298***    0.281***    0.271***    0.249***    0.275***  
##                         (0.017)    (0.016)    (0.016)    (0.016)     (0.021)     (0.020)     (0.020)     (0.022)     (0.023)     (0.026)    
##   sulphates                         0.994***   0.814***   0.821***    0.881***    0.732***    0.885***    0.905***    0.955***    0.929***  
##                                    (0.102)    (0.107)    (0.106)     (0.107)     (0.104)     (0.112)     (0.113)     (0.114)     (0.114)    
##   citric.acid                                  0.513***   0.312*      0.278*     -0.460***   -0.325*     -0.344*     -0.215      -0.231     
##                                               (0.093)    (0.125)     (0.125)     (0.137)     (0.141)     (0.142)     (0.145)     (0.145)    
##   fixed.acidity                                           0.033*      0.076***    0.077***    0.070***    0.078***    0.066***    0.031     
##                                                          (0.013)     (0.017)     (0.017)     (0.017)     (0.018)     (0.018)     (0.026)    
##   density                                                           -61.296***  -28.268     -29.221     -40.621*    -44.859*    -21.594     
##                                                                     (15.490)    (15.198)    (15.145)    (17.822)    (17.772)    (21.575)    
##   volatile.acidity                                                               -1.302***   -1.195***   -1.192***   -1.125***   -1.124***  
##                                                                                  (0.116)     (0.119)     (0.119)     (0.120)     (0.120)    
##   chlorides                                                                                  -1.444***   -1.470***   -1.646***   -1.825***  
##                                                                                              (0.408)     (0.408)     (0.409)     (0.419)    
##   residual.sugar                                                                                          0.017       0.029*      0.020     
##                                                                                                          (0.014)     (0.014)     (0.015)    
##   total.sulfur.dioxide                                                                                               -0.002***   -0.002***  
##                                                                                                                      (0.001)     (0.001)    
##   pH                                                                                                                             -0.361     
##                                                                                                                                  (0.190)    
## --------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                  0.2        0.3        0.3        0.3        0.3         0.3         0.4         0.4         0.4         0.4    
##   adj. R-squared             0.2        0.3        0.3        0.3        0.3         0.3         0.3         0.3         0.4         0.4    
##   sigma                      0.7        0.7        0.7        0.7        0.7         0.7         0.7         0.7         0.6         0.6    
##   F                        468.3      295.0      210.5      159.8      132.2       140.0       122.6       107.5        98.2        88.9    
##   p                          0.0        0.0        0.0        0.0        0.0         0.0         0.0         0.0         0.0         0.0    
##   Log-likelihood         -1721.1    -1675.1    -1660.0    -1657.0    -1649.2     -1587.9     -1581.6     -1580.9     -1573.0     -1571.2    
##   Deviance                 805.9      760.9      746.6      743.9      736.6       682.2       676.9       676.3       669.6       668.1    
##   AIC                     3448.1     3358.3     3329.9     3326.1     3312.5      3191.8      3181.2      3181.8      3168.0      3166.3    
##   BIC                     3464.2     3379.8     3356.8     3358.4     3350.1      3234.8      3229.6      3235.5      3227.1      3230.9    
##   N                       1599       1599       1599       1599       1599        1599        1599        1599        1599        1599      
## ============================================================================================================================================

seems to be a poor model. the maximum R-squared reached, including many features, is 0.4.

Multivariate Analysis:

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

there are obvious corrolations between pH and fixed.acidity and free.sulfur.dioxide and free sulfur.dioxide. although, there is a week corrolation between quality and sulphates, citric.acid and chlorides, there are many data points/samples that do not seem to have any corrolation between the features. for instance there are a lot of fluctuations in the line plot for sulfates vs. alcohol for different qualities.

Were there any interesting or surprising interactions between features?

there were a couple of them namely the relationship between alcohol and density. wines with higher alcohol seem to have on average lower density and there is a very week negative corrolation between density and quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model using quality and alcohol. alcohol only described 0.2 of variance in quality. by adding different feature, the R-squared was raised to 0.4.

Final Plots and Summary

Plot One

description 1:

The majority of samples are of quality 5 or 6.

Plot two:

Description 2:

The largest corrolation is seen between density and volatile.acidity. Higher alcohol seem to correspond to higher quality as well.

Plot three:

Description 3:

Facet wrapping wines by Quality and filling with alcohol and using volatile.acidity as x axis, shows that for higher qualities there are more of wines with higher alcohol and also counts for wines with higher alcohol are generally higher. ALso it shows that wines with higher quality have lower volatile acidity. Among wines with quality of 3 there is no sample with alcohol higher that 11.

Reflection:

The red wine dataset contains 1599 observations with 13 features. Except One feature, quality, the rest are measurable, chemical specifications of wine. quality is an abstract and non-measurable feature that is sensory and based on experts opinions. I started by examining each feature and looking at their distribution, which for most part was normal. Then tried to find the relation between different features and specially all features with the outcome feature, quality. According to corrolation figures obtained by function cor(wines), the features that are corrolated with quality are alcohol, volatile.acidity, sulphates and citric.acid. Most of whom are weekly corrolated. Further observations more or less confirmed the relations. There are a few features that are related by nature and definition, such as pH and citric.acid, as pH is the measurement of acidity. I tried to fit a linear model to the data. The best outcome was 40% of variance of quality being accounted for, through a model, using most of variables, Although it is not a very high rate, but given the sensory nature of outcome variable, quality, it is a good starting point to build more accurate models. Building non-linear models that can for instance include different orders of polynomials and collecting more data and or looking into more complete datasets are two of the options. Using a classification model instead of regression might be a good choice, as predicting wines quality in this context seems to be more of a classification problem.

======= wineQualityReds

Univariate plot section:

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## Warning in data(wines): data set 'wines' not found
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## NULL
##  [1]  7.4  7.8 11.2  7.9  7.3  7.5  6.7  5.6  8.9  8.5  8.1  7.6  6.9  6.3
## [15]  7.1  8.3  5.2  5.7  8.8  6.8  4.6  7.7  8.7  6.4  6.6  8.6 10.2  7.0
## [29]  7.2  9.3  8.0  9.7  6.2  5.0  4.7  8.4 10.1  9.4  9.0  8.2  6.1  5.8
## [43]  9.2 11.5  5.4  9.6 12.8 11.0 11.6 12.0 15.0 10.8 11.1 10.0 12.5 11.8
## [57] 10.9 10.3 11.4  9.9 10.4 13.3 10.6  9.8 13.4 10.7 11.9 12.4 12.2 13.8
## [71]  9.1 13.5 10.5 12.6 14.0 13.7  9.5 12.7 12.3 15.6  5.3 11.3 13.0  6.5
## [85] 12.9 14.3 15.5 11.7 13.2 15.9 12.1  5.1  4.9  5.9  6.0  5.5
## [1] 5 6 7 4 8 3
##  [1]  1.90  2.60  2.30  1.80  1.60  1.20  2.00  6.10  3.80  3.90  1.70
## [12]  4.40  2.40  1.40  2.50 10.70  5.50  2.10  1.50  5.90  2.80  2.20
## [23]  3.00  3.40  5.10  4.65  1.30  7.30  7.20  2.90  2.70  5.60  3.10
## [34]  3.20  3.30  3.60  4.00  7.00  6.40  3.50 11.00  3.65  4.50  4.80
## [45]  2.95  5.80  6.20  4.20  7.90  3.70  6.70  6.60  2.15  5.20  2.55
## [56] 15.50  4.10  8.30  6.55  4.60  4.30  5.15  6.30  6.00  8.60  7.50
## [67]  2.25  4.25  2.85  3.45  2.35  2.65  9.00  8.80  5.00  1.65  2.05
## [78]  0.90  8.90  8.10  4.70  1.75  7.80 12.90 13.40  5.40 15.40  3.75
## [89] 13.80  5.70 13.90
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

As for quality, most wines have been evaluated above average as median is bigger than mean. for most variables median is below mean, most notabaly for total.sulfur.dioxide where if above 50ppm the smell and taste becomes evident, the median is substantially bellow the mean, still 25% of wines have over 62ppm. with most attributes except density, PH, and to some extend alcohol, the varaition within the four quartiles is wide, specially between the min and the max which can be because of outliers.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

How does the distribution of total.sulfur.dioxide differ for different qualities? according to the description of the data set there might be a relationship between the two. I wonder how other variables will affect the quality.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Stacking not well defined when ymin != 0

Let’s see which alcohol degree is the most common.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## 
##              8.4              8.5              8.7              8.8 
##                2                1                2                2 
##                9             9.05              9.1              9.2 
##               30                1               23               72 
## 9.23333333333333             9.25              9.3              9.4 
##                1                1               59              103 
##              9.5             9.55 9.56666666666667              9.6 
##              139                2                1               59 
##              9.7              9.8              9.9             9.95 
##               54               78               49                1 
##               10 10.0333333333333             10.1             10.2 
##               67                2               47               46 
##             10.3             10.4             10.5            10.55 
##               33               41               67                2 
##             10.6             10.7            10.75             10.8 
##               28               27                1               42 
##             10.9               11 11.0666666666667             11.1 
##               49               59                1               27 
##             11.2             11.3             11.4             11.5 
##               36               32               32               30 
##             11.6             11.7             11.8             11.9 
##               15               23               29               20 
##            11.95               12             12.1             12.2 
##                1               21               13               12 
##             12.3             12.4             12.5             12.6 
##               12               13               21                6 
##             12.7             12.8             12.9               13 
##                9               17                9                6 
##             13.1             13.2             13.3             13.4 
##                2                1                3                3 
##             13.5 13.5666666666667             13.6               14 
##                1                1                4                7 
##             14.9 
##                1

A large number of wines fall between 9 and 10 degrees of alcohol. The median is 10.2.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## 
##  4.6  4.7  4.9    5  5.1  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9    6  6.1 
##    1    1    1    6    4    6    4    5    1   14    2    4    9   13   16 
##  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9    7  7.1  7.2  7.3  7.4  7.5  7.6 
##   20   14   25   17   37   28   46   38   50   57   67   44   44   52   46 
##  7.7  7.8  7.9    8  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9    9  9.1 
##   49   53   42   42   26   45   40   26   19   27   24   34   33   26   29 
##  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9   10 10.1 10.2 10.3 10.4 10.5 10.6 
##   16   22   17   14   17    9   15   26   23   10   19   11   21   12   14 
## 10.7 10.8 10.9   11 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9   12 12.1 
##   10   10    8    3    9    5    7    5   13   12    3    3   12    7    1 
## 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9   13 13.2 13.3 13.4 13.5 13.7 13.8 
##    4    5    4    7    4    4    5    2    3    3    3    1    1    2    1 
##   14 14.3   15 15.5 15.6 15.9 
##    1    1    2    2    2    1

For fixed acidity, median is 7.90 and mean is lower because of outliers

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## 
##  0.12  0.16  0.18  0.19   0.2  0.21  0.22  0.23  0.24  0.25  0.26  0.27 
##     3     2    10     2     3     6     6     5    13     7    16    14 
##  0.28  0.29 0.295   0.3 0.305  0.31 0.315  0.32  0.33  0.34  0.35  0.36 
##    23    16     1    16     2    30     2    23    20    30    22    38 
## 0.365  0.37  0.38  0.39 0.395   0.4  0.41 0.415  0.42  0.43  0.44  0.45 
##     2    24    35    35     2    37    33     3    31    43    23    22 
##  0.46  0.47 0.475  0.48  0.49   0.5  0.51  0.52  0.53  0.54 0.545  0.55 
##    31    21     2    24    35    46    24    33    29    31     5    20 
##  0.56 0.565  0.57 0.575  0.58 0.585  0.59 0.595   0.6 0.605  0.61 0.615 
##    34     1    28     3    38     3    39     1    47     3    27     6 
##  0.62 0.625  0.63 0.635  0.64 0.645  0.65 0.655  0.66 0.665  0.67 0.675 
##    24     3    29     9    27    12    16     7    26     3    23     3 
##  0.68 0.685  0.69 0.695   0.7 0.705  0.71 0.715  0.72 0.725  0.73 0.735 
##    12    11    23     7    10     6     3    12     5     9     6     8 
##  0.74 0.745  0.75 0.755  0.76 0.765  0.77 0.775  0.78 0.785  0.79 0.795 
##    11     5     6     3     5     5     6     4    10     8     2     2 
##   0.8 0.805  0.81 0.815  0.82 0.825  0.83 0.835  0.84 0.845  0.85 0.855 
##     3     1     2     3     5     1     4     4     8     1     2     3 
##  0.86 0.865  0.87 0.875  0.88 0.885  0.89 0.895   0.9  0.91 0.915  0.92 
##     2     1     4     2     5     5     1     1     3     3     4     1 
## 0.935  0.95 0.955  0.96 0.965 0.975  0.98     1 1.005  1.01  1.02 1.025 
##     2     1     1     3     3     1     3     3     1     1     4     1 
## 1.035  1.04  1.07  1.09 1.115  1.13  1.18 1.185  1.24  1.33  1.58 
##     1     3     1     1     1     1     1     1     1     2     1

I will create a new variable called total acidity and I wonder if it has a direct corolation with quality

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I wonder if there is any connection between percentage of alcohol and the quality

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## 
##              8.4              8.5              8.7              8.8 
##                2                1                2                2 
##                9             9.05              9.1              9.2 
##               30                1               23               72 
## 9.23333333333333             9.25              9.3              9.4 
##                1                1               59              103 
##              9.5             9.55 9.56666666666667              9.6 
##              139                2                1               59 
##              9.7              9.8              9.9             9.95 
##               54               78               49                1 
##               10 10.0333333333333             10.1             10.2 
##               67                2               47               46 
##             10.3             10.4             10.5            10.55 
##               33               41               67                2 
##             10.6             10.7            10.75             10.8 
##               28               27                1               42 
##             10.9               11 11.0666666666667             11.1 
##               49               59                1               27 
##             11.2             11.3             11.4             11.5 
##               36               32               32               30 
##             11.6             11.7             11.8             11.9 
##               15               23               29               20 
##            11.95               12             12.1             12.2 
##                1               21               13               12 
##             12.3             12.4             12.5             12.6 
##               12               13               21                6 
##             12.7             12.8             12.9               13 
##                9               17                9                6 
##             13.1             13.2             13.3             13.4 
##                2                1                3                3 
##             13.5 13.5666666666667             13.6               14 
##                1                1                4                7 
##             14.9 
##                1

most wines have 9.50% - 11.10% alcohol. Median is 10.2%

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

132 of wines in the data set have 0 citric acid. as per description of the data set, citric acid can add freshness and flavour to wines. I wonder if it has any affect on the variable “quality”" in this data set and how the two might be connected.The difference between the first quartile and the median is roughly 30 fold. that shows that a large number of wines have a very low amount of citric acid

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Stacking not well defined when ymin != 0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
## 
##  0.9  1.2  1.3  1.4  1.5  1.6 1.65  1.7 1.75  1.8  1.9    2 2.05  2.1 2.15 
##    2    8    5   35   30   58    2   76    2  129  117  156    2  128    2 
##  2.2 2.25  2.3 2.35  2.4  2.5 2.55  2.6 2.65  2.7  2.8 2.85  2.9 2.95    3 
##  131    1  109    1   86   84    1   79    1   39   49    1   24    1   25 
##  3.1  3.2  3.3  3.4 3.45  3.5  3.6 3.65  3.7 3.75  3.8  3.9    4  4.1  4.2 
##    7   15   11   15    1    2    8    1    4    1    8    6   11    6    5 
## 4.25  4.3  4.4  4.5  4.6 4.65  4.7  4.8    5  5.1 5.15  5.2  5.4  5.5  5.6 
##    1    8    4    4    6    2    1    3    1    5    1    3    1    8    6 
##  5.7  5.8  5.9    6  6.1  6.2  6.3  6.4 6.55  6.6  6.7    7  7.2  7.3  7.5 
##    1    4    3    4    4    3    2    3    2    2    2    1    1    1    1 
##  7.8  7.9  8.1  8.3  8.6  8.8  8.9    9 10.7   11 12.9 13.4 13.8 13.9 15.4 
##    2    3    2    3    1    2    1    1    1    2    1    1    2    1    2 
## 15.5 
##    1

There is a huge difference between the max (15.5) and the the 3rd qu. for sugar. That shows that there are outliers towards the end spectrum. using scale-y-log10 will shed a light on outliers and scale-x-log10 will show the normal distribution (bell shaped).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## 
## 0.012 0.034 0.038 0.039 0.041 0.042 0.043 0.044 0.045 0.046 0.047 0.048 
##     2     1     2     4     4     3     1     5     4     4     4     8 
## 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058 0.059  0.06 
##     8    12     1    10     5    13     8     9    10    14    17    16 
## 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069  0.07 0.071 0.072 
##    11    24    22    20    23    32    27    30    21    35    47    24 
## 0.073 0.074 0.075 0.076 0.077 0.078 0.079  0.08 0.081 0.082 0.083 0.084 
##    35    55    45    51    47    51    43    66    40    46    35    49 
## 0.085 0.086 0.087 0.088 0.089  0.09 0.091 0.092 0.093 0.094 0.095 0.096 
##    25    31    25    32    25    21    19    22    21    19    23    18 
## 0.097 0.098 0.099   0.1 0.101 0.102 0.103 0.104 0.105 0.106 0.107 0.108 
##    18    12     8    13     5    10     7    16     6     8     9     1 
## 0.109  0.11 0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118 0.119  0.12 
##     3     8     7     6     1    11     5     2     4     8     3     3 
## 0.121 0.122 0.123 0.124 0.125 0.126 0.127 0.128 0.132 0.136 0.137 0.143 
##     2     7     6     3     1     1     1     1     4     1     1     1 
## 0.145 0.146 0.147 0.148 0.152 0.153 0.157 0.159 0.161 0.165 0.166 0.168 
##     1     1     1     1     2     1     3     1     1     1     3     1 
## 0.169  0.17 0.171 0.172 0.174 0.176 0.178 0.186  0.19 0.194   0.2 0.205 
##     1     1     2     1     1     1     2     1     1     1     1     2 
## 0.213 0.214 0.216 0.222 0.226  0.23 0.235 0.236 0.241 0.243  0.25 0.263 
##     1     3     1     1     2     1     1     1     1     1     1     1 
## 0.267  0.27 0.332 0.337 0.341 0.343 0.358  0.36 0.368 0.369 0.387 0.401 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 0.403 0.413 0.414 0.415 0.422 0.464 0.467  0.61 0.611 
##     1     1     2     3     1     1     1     1     1

again with chloride we see outliers to the right.

transformed the long-tailed data to understand it better.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

###another transformation accross y access

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Stacking not well defined when ymin != 0

total.sulfur.dioxide seem to be another factor that might have negative affect on the smell and taste specially if it is over 50.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## 
##    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
##    3    4   14   14   27   26   29   28   33   35   26   27   35   29   33 
##   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35 
##   25   25   34   36   27   24   30   43   20   14   32   20   17   20   26 
##   36   37   38   39   40   41   42   43   44   45   46   47   48   49   50 
##   12   26   31   16   17   14   26   18   23   20   17   24   21   21   11 
##   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
##   11   15   14   20   13   10    6   14    9   18    9    9   13   10   17 
##   66   67   68   69   70   71   72   73   74   75   76   77 77.5   78   79 
##    9   12   10    8    8    7   10    7    8    5    3    8    2    4    5 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##    4    6    4    2    6    9   10    6   14    9    5    7    8    2    8 
##   95   96   98   99  100  101  102  103  104  105  106  108  109  110  111 
##    4    5    7    6    3    4    6    2    5    5    6    3    4    6    3 
##  112  113  114  115  116  119  120  121  122  124  125  126  127  128  129 
##    3    4    2    2    1    7    2    4    3    3    2    1    2    2    3 
##  130  131  133  134  135  136  139  140  141  142  143  144  145  147  148 
##    1    3    3    2    2    2    1    1    3    1    2    3    3    3    2 
##  149  151  152  153  155  160  165  278  289 
##    1    2    1    1    1    1    1    1    1

differce between the mean and median is larger than many other variables. median is 38 and mean is 46.47. there are only 9 samples between 150 and 289.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## 
##    1    2    3    4    5  5.5    6    7    8    9   10   11   12   13   14 
##    3    1   49   41  104    1  138   71   56   62   79   59   75   57   50 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##   78   61   60   46   39   30   41   22   32   34   24   32   29   23   23 
##   30   31   32   33   34   35   36   37 37.5   38   39   40 40.5   41   42 
##   16   20   22   11   18   15   11    3    2    9    5    6    1    7    3 
##   43   45   46   47   48   50   51   52   53   54   55   57   66   68   72 
##    3    3    1    1    4    2    4    3    1    1    2    1    1    2    1

again the data is skewed in case of free.sulfur.dioxide and I have to do log transformation in order to see the distribution. mean is 15.87 and median is 14 for free.sulfur.oxide.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## 
## 0.33 0.37 0.39  0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##    1    2    6    4    5    8   16   12   18   19   29   31   27   26   47 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##   51   68   50   60   55   68   51   69   45   61   48   46   41   42   36 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   35   23   33   26   28   26   26   20   25   26   23   18   19   15   22 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 
##   15   13   14   13   13    7    7    8    8    5   10    4    2    3    6 
## 0.98 0.99    1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09  1.1 1.11 1.12 
##    2    3    1    1    3    2    2    3    4    2    3    1    2    1    1 
## 1.13 1.14 1.15 1.16 1.17 1.18  1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56 
##    2    2    1    1    5    3    1    1    1    2    1    1    1    3    1 
## 1.59 1.61 1.62 1.95 1.98    2 
##    1    1    1    2    1    1

distribution for sulphates amounts in also right-skewed. there are outliers, but the difference between different quartiles is not as stark.

## [1] 1599
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040
## 
## 0.99007  0.9902 0.99064  0.9908 0.99084  0.9912  0.9915 0.99154 0.99157 
##       2       1       2       1       1       1       1       1       1 
##  0.9916 0.99162  0.9917 0.99182 0.99191  0.9921  0.9922 0.99235 0.99236 
##       2       1       1       2       1       1       2       1       1 
##  0.9924 0.99242 0.99252 0.99256 0.99258 0.99264  0.9927  0.9928 0.99286 
##       3       2       1       1       3       1       1       2       1 
##  0.9929 0.99292 0.99294 0.99306 0.99314 0.99316 0.99318  0.9932 0.99322 
##       1       1       2       1       1       2       1       1       1 
## 0.99323 0.99328  0.9933 0.99331 0.99332 0.99334 0.99336  0.9934 0.99341 
##       1       1       1       2       1       1       1       4       1 
## 0.99344 0.99346 0.99348  0.9935 0.99352 0.99354 0.99356 0.99357 0.99358 
##       1       3       1       1       2       2       4       1       3 
##  0.9936 0.99362 0.99364  0.9937 0.99371 0.99374 0.99376 0.99378 0.99379 
##       2       2       1       2       2       2       3       3       1 
##  0.9938 0.99384 0.99385 0.99386 0.99387 0.99388 0.99392 0.99394 0.99395 
##       1       1       1       1       1       2       2       1       1 
## 0.99396 0.99397   0.994 0.99402 0.99408  0.9941 0.99414 0.99416 0.99417 
##       3       1       2       4       3       1       2       1       1 
## 0.99418 0.99419  0.9942 0.99425 0.99426 0.99428  0.9943 0.99434 0.99437 
##       2       2       3       1       1       1       2       1       1 
## 0.99438 0.99439  0.9944 0.99444 0.99448 0.99451 0.99454 0.99456 0.99458 
##       5       1       3       4       4       1       1       1       4 
## 0.99459  0.9946 0.99462 0.99464 0.99467 0.99468  0.9947 0.99471 0.99472 
##       1       5       2       2       2       1       6       3       3 
## 0.99473 0.99474 0.99476 0.99478 0.99479  0.9948 0.99483 0.99484 0.99486 
##       1       1       3       2       1       9       1       3       1 
## 0.99488 0.99489  0.9949 0.99491 0.99492 0.99494 0.99495 0.99496 0.99498 
##       4       3       4       1       2       4       2       1       5 
## 0.99499   0.995 0.99501 0.99502 0.99504 0.99506 0.99508 0.99509  0.9951 
##       1      10       1       2       2       1       3       1       4 
## 0.99512 0.99514 0.99516 0.99517 0.99518 0.99519  0.9952 0.99521 0.99522 
##       2       5       6       1       3       1       9       1       4 
## 0.99523 0.99524 0.99525 0.99526 0.99528 0.99529  0.9953 0.99531 0.99532 
##       1       4       2       2       3       1       4       2       1 
## 0.99533 0.99534 0.99536 0.99538  0.9954 0.99541 0.99542 0.99543 0.99544 
##       1       6       2      11       4       1       1       2       1 
## 0.99545 0.99546 0.99547 0.99549  0.9955 0.99551 0.99552 0.99553 0.99554 
##       3       7       2       2      14       3       5       1       3 
## 0.99555 0.99556 0.99557 0.99558  0.9956 0.99562 0.99564 0.99565 0.99566 
##       1       2       3       3      14       4       2       3       4 
## 0.99568 0.99569  0.9957 0.99572 0.99573 0.99574 0.99575 0.99576 0.99577 
##       4       1       6       9       1       2       2       5       3 
## 0.99578  0.9958 0.99581 0.99582 0.99584 0.99585 0.99586 0.99587 0.99588 
##       3      14       1       1       2       3       6       2       4 
## 0.99589  0.9959 0.99592 0.99593 0.99594 0.99596 0.99598 0.99599   0.996 
##       1      13       4       2       1       2       2       2      13 
## 0.99603 0.99604 0.99605 0.99606 0.99608 0.99609  0.9961 0.99612 0.99613 
##       2       3       3       2       2       1      10       6       4 
## 0.99614 0.99615 0.99616 0.99617 0.99619  0.9962 0.99621 0.99622 0.99623 
##       2       5       7       1       1      28       1       5       2 
## 0.99624 0.99625 0.99627 0.99628 0.99629  0.9963 0.99631 0.99632 0.99633 
##       3       3       3       3       2      15       1       4       4 
## 0.99634 0.99635 0.99636 0.99638 0.99639  0.9964 0.99641 0.99642 0.99643 
##       3       1       5       5       2      25       1       3       1 
## 0.99645 0.99646 0.99647 0.99648 0.99649  0.9965 0.99651 0.99652 0.99654 
##       1       1       2       3       1      11       1       6       2 
## 0.99655 0.99656 0.99658 0.99659  0.9966 0.99661 0.99664 0.99665 0.99666 
##       6       5       1       2      23       1       3       1       3 
## 0.99667 0.99668 0.99669  0.9967 0.99672 0.99674 0.99675 0.99676 0.99677 
##       1       4       2      13       5       2       5       3       2 
## 0.99678  0.9968 0.99682 0.99683 0.99684 0.99685 0.99686 0.99688 0.99689 
##       1      35       2       2       1       8       3       2       4 
##  0.9969 0.99692 0.99693 0.99694 0.99695 0.99697 0.99698 0.99699   0.997 
##      18       4       2       3       1       1       1       1      24 
## 0.99701 0.99702 0.99704 0.99705 0.99706 0.99708 0.99709  0.9971 0.99712 
##       2       4       3       1       2       4       1      13       4 
## 0.99713 0.99714 0.99716 0.99717 0.99718 0.99719  0.9972 0.99721 0.99722 
##       2       2       2       1       3       1      36       1       1 
## 0.99724 0.99725 0.99726 0.99727 0.99728 0.99729  0.9973 0.99732 0.99733 
##       4       1       1       1       3       1      18       3       1 
## 0.99734 0.99735 0.99736 0.99738 0.99739  0.9974 0.99743 0.99744 0.99745 
##       4       6       5       4       1      22       2       2       9 
## 0.99746 0.99747 0.99748  0.9975 0.99752 0.99754 0.99756 0.99758  0.9976 
##       7       2       3       7       1       1       1       1      35 
## 0.99761 0.99764 0.99765 0.99768 0.99769  0.9977 0.99772 0.99774 0.99779 
##       1       1       1       3       2       4       1       5       1 
##  0.9978 0.99782 0.99783 0.99784 0.99785 0.99786 0.99787 0.99788  0.9979 
##      26       2       2       1       1       4       3       2      14 
## 0.99791 0.99796 0.99798   0.998 0.99801 0.99803 0.99808  0.9981 0.99814 
##       1       1       2      29       2       3       1      10       2 
## 0.99815 0.99817 0.99818  0.9982 0.99822 0.99823 0.99824 0.99828  0.9983 
##       2       2       3      23       1       1       3       2       9 
## 0.99832 0.99834 0.99836  0.9984 0.99842 0.99845  0.9985 0.99852 0.99854 
##       1       1       2      20       2       1       3       1       1 
## 0.99855 0.99859  0.9986 0.99864 0.99865  0.9987 0.99878  0.9988 0.99888 
##       2       1      19       1       2      12       1      20       2 
##  0.9989 0.99892   0.999 0.99901  0.9991 0.99914 0.99915 0.99918  0.9992 
##       2       3       8       1      10       3       1       1       7 
## 0.99922 0.99925  0.9993 0.99935 0.99938 0.99939  0.9994  0.9995  0.9996 
##       1       1       4       1       1       1      24       1      12 
## 0.99965  0.9997 0.99974 0.99975 0.99976  0.9998  0.9999       1 1.00005 
##       1       8       1       1       1      10       1      10       2 
##  1.0001 1.00012 1.00015  1.0002 1.00024 1.00025  1.0003  1.0004  1.0006 
##       4       1       2      10       1       1       2       9       6 
##  1.0008   1.001  1.0014  1.0015  1.0018  1.0021  1.0022 1.00242  1.0026 
##       3       6       6       2       1       2       2       2       2 
## 1.00289 1.00315  1.0032 1.00369 
##       1       3       1       2

the distribution is normal for different densities. with first quartile, median, mean and third quartile very close to each other.

Univariate analysis

What is the structure of your dataset?

there are 1599 observations (red wine samples) in the dataset and 11 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chloride, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). most of features except for density, pH and quality are right-skewed. and have some extreme outliers to the right.

other observations:

As for quality, most wines have been evaluated above average as median is bigger than mean. for most variables median is below mean, most notabaly for total.sulfur.dioxide where if above 50ppm the smell and taste becomes evident, the median is substantially bellow the mean, still 25% of wines have over 62ppm. with most attributes except density, PH, and to some extend alcohol, the varaition within the four quartiles is wide, specially between the min and the max which can be because of outliers.

What is/are the main feature(s) of interest in your dataset?

the main feature of interest in my dataset is quality. I would like to know what features affected the determination of the quality by experts. I suspect total.sulfur.dioxide, residual.sugar, volatile.acidity and citric.acid would have the most effect.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

total.sulfur.dioxide, residual.sugar, volatile.acidity and citric.acid are features that I am most interested in, but a looking into other features or a combination of some of them might be of help in effective investigation of the dataset and building a model.

Did you create any new variables from existing variables in the dataset?

I created a new feature called total.acidity which is the sum of fixed.acidity and volatile.acidity.

I will have to examine if it has any connection to the quality and if it improves building a model.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

most of the features where right-skewed and I log-transformed them to get a better sense of the data. in case of total.sulfur.dioxide, it was done on the y axis and in case of residual.sugar it was done on both axes separately, as it is both right skewed and it has a wide range of outliers.

Bivariate Plots Section

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.26848392     -0.008815099
## fixed.acidity        -0.268483920    1.00000000     -0.256130895
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                      citric.acid residual.sugar    chlorides
## X                    -0.15355136   -0.031260835 -0.119868519
## fixed.acidity         0.67170343    0.114776724  0.093705186
## volatile.acidity     -0.55249568    0.001917882  0.061297772
## citric.acid           1.00000000    0.143577162  0.203822914
## residual.sugar        0.14357716    1.000000000  0.055609535
## chlorides             0.20382291    0.055609535  1.000000000
## free.sulfur.dioxide  -0.06097813    0.187048995  0.005562147
## total.sulfur.dioxide  0.03553302    0.203027882  0.047400468
## density               0.36494718    0.355283371  0.200632327
## pH                   -0.54190414   -0.085652422 -0.265026131
## sulphates             0.31277004    0.005527121  0.371260481
## alcohol               0.10990325    0.042075437 -0.221140545
## quality               0.22637251    0.013731637 -0.128906560
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                            0.090479643          -0.11784967 -0.36837209
## fixed.acidity               -0.153794193          -0.11318144  0.66804729
## volatile.acidity            -0.010503827           0.07647000  0.02202623
## citric.acid                 -0.060978129           0.03553302  0.36494718
## residual.sugar               0.187048995           0.20302788  0.35528337
## chlorides                    0.005562147           0.04740047  0.20063233
## free.sulfur.dioxide          1.000000000           0.66766645 -0.02194583
## total.sulfur.dioxide         0.667666450           1.00000000  0.07126948
## density                     -0.021945831           0.07126948  1.00000000
## pH                           0.070377499          -0.06649456 -0.34169933
## sulphates                    0.051657572           0.04294684  0.14850641
## alcohol                     -0.069408354          -0.20565394 -0.49617977
## quality                     -0.050656057          -0.18510029 -0.17491923
##                               pH    sulphates     alcohol     quality
## X                     0.13600533 -0.125306999  0.24512284  0.06645261
## fixed.acidity        -0.68297819  0.183005664 -0.06166827  0.12405165
## volatile.acidity      0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid          -0.54190414  0.312770044  0.10990325  0.22637251
## residual.sugar       -0.08565242  0.005527121  0.04207544  0.01373164
## chlorides            -0.26502613  0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.07037750  0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456  0.042946836 -0.20565394 -0.18510029
## density              -0.34169933  0.148506412 -0.49617977 -0.17491923
## pH                    1.00000000 -0.196647602  0.20563251 -0.05773139
## sulphates            -0.19664760  1.000000000  0.09359475  0.25139708
## alcohol               0.20563251  0.093594750  1.00000000  0.47616632
## quality              -0.05773139  0.251397079  0.47616632  1.00000000

For quality, the strongest positive corrolation is seen with alcohol and a weak corrolation with sulphates and citric acid. There is a negative corrolation between quality and volatile acidity and a weak negative corrolation with total sulfur dioxide and chloride.There is a strong corrolation between density and fixed acidity and within pH and fixed acidity.

Using scatterplot to see relation-ship between quality and alcohol, pH, density, citric acid and some other features

Vertical strips show that the values of quality are discrete integers.

## Warning: Removed 49 rows containing missing values (geom_point).

as citric acid increases, the variation in fixed acidity increases. The relation between the two seem to be linear.

Above we can see the linear relation between the two variables more clearly and also the increase of variation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## Warning: position_stack requires non-overlapping x intervals

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

it seems like most of the wines with higher quality have a higher level of citric acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## Warning: position_stack requires non-overlapping x intervals

there is a corrolation between amount of alcohol and quality and there is no low alcohol wine with high quality let’s see them in numbers.

## Warning: Removed 1 rows containing non-finite values (stat_summary).

## 
##              9.2              9.5              9.7              9.8 
##                2                2                2                2 
##              9.9               10             10.1             10.2 
##                4                9                2                4 
##             10.3             10.4             10.5            10.55 
##                1                1               10                1 
##             10.6             10.7             10.8             10.9 
##                6                1               11                5 
##               11             11.1             11.2             11.3 
##               13                4               10                8 
##             11.4             11.5             11.6             11.7 
##                3                6                6               13 
##             11.8             11.9               12             12.1 
##               11                5                9                8 
##             12.2             12.3             12.4             12.5 
##                4                7                6               10 
##             12.6             12.7             12.8             12.9 
##                3                3                8                4 
##               13             13.1             13.3             13.4 
##                2                1                1                2 
## 13.5666666666667             13.6               14 
##                1                3                3

wines with higher alcohol have usually higher quality

most wines have quality which is 5 and 6.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

highest quality wines (8) have the highest median and the lowest quality wines which are labeled at 3 have the lowest amount of alcohol, except for the ones that are scored at 5.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.162   3.230   3.267   3.350   3.720

I see a weak trend towards more basic wines having higher quality score. although the corrolation is very weak we can see that the median for wines with quality 8 is highest.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

I see a relatively stronger corrolation between volatile acidity and quality(a negative one)

there seem to be positive corrolation between the two vairables citric acid and density. But they corrolations with quality seem to be opposite to one an other.

## Warning: Removed 162 rows containing non-finite values (stat_smooth).
## Warning: Removed 162 rows containing missing values (geom_point).

the relationship between citric.acid and density seem to be linear but it’s week and datapoints are very dispersed (there is a big variation)

## 
## Calls:
## m1: lm(formula = density ~ citric.acid, data = subset(wines, citric.acid > 
##     0 & citric.acid <= quantile(wines$citric.acid, 0.999)))
## 
## =============================
##   (Intercept)      0.996***  
##                   (0.000)    
##   citric.acid      0.004***  
##                   (0.000)    
## -----------------------------
##   R-squared            0.1   
##   adj. R-squared       0.1   
##   sigma                0.0   
##   F                  210.8   
##   p                    0.0   
##   Log-likelihood    7229.2   
##   Deviance             0.0   
##   AIC             -14452.3   
##   BIC             -14436.5   
##   N                 1465     
## =============================

the model trained based on citric.acid to explain density, explains only 10% of variance which is negligble.

there is a corrolation between density and fixed.acidity. the higher the fixed.acidity, the higher the density.

## 
## Calls:
## m2: lm(formula = quality ~ alcohol, data = subset(wines, alcohol > 
##     0 & alcohol <= quantile(wines$alcohol, 0.999)))
## 
## =============================
##   (Intercept)      1.818***  
##                   (0.175)    
##   alcohol          0.366***  
##                   (0.017)    
## -----------------------------
##   R-squared            0.2   
##   adj. R-squared       0.2   
##   sigma                0.7   
##   F                  480.4   
##   p                    0.0   
##   Log-likelihood   -1715.4   
##   Deviance           800.7   
##   AIC               3436.8   
##   BIC               3452.9   
##   N                 1598     
## =============================

despite corrolation of 0.47 between alcohol and quality, the model only explains 20% of variance of quality

## 
## Calls:
## m3: lm(formula = quality ~ volatile.acidity, data = wines)
## 
## ===============================
##   (Intercept)        6.566***  
##                     (0.058)    
##   volatile.acidity  -1.761***  
##                     (0.104)    
## -------------------------------
##   R-squared              0.2   
##   adj. R-squared         0.2   
##   sigma                  0.7   
##   F                    287.4   
##   p                      0.0   
##   Log-likelihood     -1794.3   
##   Deviance             883.2   
##   AIC                 3594.6   
##   BIC                 3610.8   
##   N                   1599     
## ===============================

only 20% of variance explained here. Perhaps I should add more features to the model in the next part.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is a moderate corrolation between quality and volatile.acidity. There is a stronger corrolation between quality and alcohol and a weaker one with citric.acid and sulphates and density.

wines with higher amount of citric acid, alcohol and sulphates are likelier to have a higher quality. and the corrolation with volatile.acidity seem to be negative.

most wines have quality of 5 or 6 (80-90%).

wines with higher acid citric seem to have a higher density.

using R2 to explain variance in quality based on one feature does not seem to give a good result. I next section I will use more than one feature and see if there is any improvements.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

there is a corrolation between free.sulfur.dioxide and total.sulfur.dioxide and it is understandable because one is subset of the other. also between acid.citric and density. there is even a stronger one between density and fixed acidity.

What was the strongest relationship you found?

the strongest relationship is between fixed.acidity and pH. the higher the fixed.acidity, the lower the pH. There is also a strong corrolation between density and fixed.acidity. there are not very strong relation between any of them and quality.

Multivariate Plots Section:

I did the second plot with only the top and lowest quality to make the distinction more clearly. The first plot is for all different qualities. It seems that comparing between the lowest quality and the higest, for the same amount of sulfate the wines have lower pH.

As expected with alcohol, the higher the alcohol for the same amount of sulfate the quality seems to be higher.

The general trend seem to be for wines with higher volatile.acidity seem to have lower quality. this corresponds with the corrolation results. we can see for higer qualities higher alcohol seem to be compensating for higher volatile.acidity.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.04975 0.35250 1.56300 3.21400 5.94000 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.270   0.882   1.757   2.700   9.400 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.950   2.185   2.412   3.572  10.270 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.972   2.764   2.923   4.654   9.112 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.310   4.720   4.288   5.685   9.880 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.420   3.375   5.116   4.624   6.160   8.978

the product of the two positively corrolated features seem to demonstrate their affect in quality more clearly.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.360   5.171   5.320   5.637   5.681   8.514 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.003   4.753   5.871   6.092   6.612  18.800 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.705   5.130   5.723   6.145   6.615  19.400 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.171   5.985   6.763   7.171   8.033  19.300 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.563   7.326   8.378   8.493   9.564  13.560 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.150   8.045   9.322   9.257  10.520  11.480

Corrolation appears here as well.

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00245 0.01930 0.11220 0.23840 0.37620 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0144  0.0456  0.1290  0.1408  2.0000 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0510  0.1292  0.1602  0.2310  0.9576 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0567  0.1680  0.1925  0.2965  0.9044 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2008  0.2976  0.2837  0.3893  0.7344 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0246  0.2164  0.3424  0.2966  0.3675  0.5904

the difference between the median of product of sulphates and citric.acid at quality 3 and 8 seem to be manyfold.

as I expected, higher quality wines tend to have higher alcohol, which overshadows the affect of higher sulphates which is very weak.

general trend seem to be for higher sulphates, lower volatile.acidity and higher alcohol to have higher quality.

## 
## Calls:
## m4: lm(formula = quality ~ alcohol, data = wines)
## m5: lm(formula = quality ~ alcohol + sulphates, data = wines)
## m6: lm(formula = quality ~ alcohol + sulphates + citric.acid, data = wines)
## m7: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity, 
##     data = wines)
## m8: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density, data = wines)
## m9: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity, data = wines)
## m10: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + chlorides, data = wines)
## m11: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + chlorides + residual.sugar, 
##     data = wines)
## m12: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + chlorides + residual.sugar + 
##     total.sulfur.dioxide, data = wines)
## m13: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity + 
##     density + volatile.acidity + chlorides + residual.sugar + 
##     total.sulfur.dioxide + pH, data = wines)
## 
## ============================================================================================================================================
##                            m4         m5         m6         m7          m8          m9         m10         m11         m12         m13      
## --------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            1.875***   1.375***   1.434***   1.138***   62.356***   30.401*     31.514*     42.871*     47.404**    25.493     
##                         (0.175)    (0.177)    (0.176)    (0.214)    (15.472)    (15.163)    (15.111)    (17.775)    (17.729)    (21.142)    
##   alcohol                0.361***   0.346***   0.338***   0.346***    0.296***    0.298***    0.281***    0.271***    0.249***    0.275***  
##                         (0.017)    (0.016)    (0.016)    (0.016)     (0.021)     (0.020)     (0.020)     (0.022)     (0.023)     (0.026)    
##   sulphates                         0.994***   0.814***   0.821***    0.881***    0.732***    0.885***    0.905***    0.955***    0.929***  
##                                    (0.102)    (0.107)    (0.106)     (0.107)     (0.104)     (0.112)     (0.113)     (0.114)     (0.114)    
##   citric.acid                                  0.513***   0.312*      0.278*     -0.460***   -0.325*     -0.344*     -0.215      -0.231     
##                                               (0.093)    (0.125)     (0.125)     (0.137)     (0.141)     (0.142)     (0.145)     (0.145)    
##   fixed.acidity                                           0.033*      0.076***    0.077***    0.070***    0.078***    0.066***    0.031     
##                                                          (0.013)     (0.017)     (0.017)     (0.017)     (0.018)     (0.018)     (0.026)    
##   density                                                           -61.296***  -28.268     -29.221     -40.621*    -44.859*    -21.594     
##                                                                     (15.490)    (15.198)    (15.145)    (17.822)    (17.772)    (21.575)    
##   volatile.acidity                                                               -1.302***   -1.195***   -1.192***   -1.125***   -1.124***  
##                                                                                  (0.116)     (0.119)     (0.119)     (0.120)     (0.120)    
##   chlorides                                                                                  -1.444***   -1.470***   -1.646***   -1.825***  
##                                                                                              (0.408)     (0.408)     (0.409)     (0.419)    
##   residual.sugar                                                                                          0.017       0.029*      0.020     
##                                                                                                          (0.014)     (0.014)     (0.015)    
##   total.sulfur.dioxide                                                                                               -0.002***   -0.002***  
##                                                                                                                      (0.001)     (0.001)    
##   pH                                                                                                                             -0.361     
##                                                                                                                                  (0.190)    
## --------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                  0.2        0.3        0.3        0.3        0.3         0.3         0.4         0.4         0.4         0.4    
##   adj. R-squared             0.2        0.3        0.3        0.3        0.3         0.3         0.3         0.3         0.4         0.4    
##   sigma                      0.7        0.7        0.7        0.7        0.7         0.7         0.7         0.7         0.6         0.6    
##   F                        468.3      295.0      210.5      159.8      132.2       140.0       122.6       107.5        98.2        88.9    
##   p                          0.0        0.0        0.0        0.0        0.0         0.0         0.0         0.0         0.0         0.0    
##   Log-likelihood         -1721.1    -1675.1    -1660.0    -1657.0    -1649.2     -1587.9     -1581.6     -1580.9     -1573.0     -1571.2    
##   Deviance                 805.9      760.9      746.6      743.9      736.6       682.2       676.9       676.3       669.6       668.1    
##   AIC                     3448.1     3358.3     3329.9     3326.1     3312.5      3191.8      3181.2      3181.8      3168.0      3166.3    
##   BIC                     3464.2     3379.8     3356.8     3358.4     3350.1      3234.8      3229.6      3235.5      3227.1      3230.9    
##   N                       1599       1599       1599       1599       1599        1599        1599        1599        1599        1599      
## ============================================================================================================================================

seems to be a poor model. the maximum R-squared reached, including many features, is 0.4.

Multivariate Analysis:

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

there are obvious corrolations between pH and fixed.acidity and free.sulfur.dioxide and free sulfur.dioxide.

although, there is a week corrolation between quality and sulphates, citric.acid and chlorides, there are many data points/samples that do not seem to have any corrolation between the features. for instance there are a lot of fluctuations in the line plot for sulfates vs. alcohol for different qualities.

Were there any interesting or surprising interactions between features?

there were a couple of them namely the relationship between alcohol and density. wines with higher alcohol seem to have on average lower density and there is a very week negative corrolation between density and quality.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model using quality and alcohol. alcohol only described 0.2 of variance in quality.

by addin different feature, the R-squared was raised to 0.4.

Final Plots and Summary

Plot One

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

description 1:

The majority of samples are of quality 5 or 6 and density seem to have a normal distribution.

Plot two:

## Warning: position_stack requires non-overlapping x intervals

Description 2:

The largest corrolation is seen between density and volatile.acidity. Higher alcohol seem to correspond to higher quality as well. pH does not seem to have a strong numerical corrolation with quality, but from the box plot, it seems that the median pH of the wines that are assessed as having quality 7-8 have lower pH comparing to the ones that have a quality of 3-4.

Plot three:

Description 3:

Facet wrapping wines by Quality and filling with alcohol and using volatile.acidity as x axis, shows that for higher qualities there are more of wines with higher alcohol and also counts for wines with higher alcohol are generally higher. ALso it shows that wines with higher quality have lower volatile acidity. Among wines with quality of 3 there is no sample with alcohol higher that 11.

Reflection:

>>>>>>> origin/master